Vectorized hash grouping by a single text column #7586

akuzm · 2025-01-10T13:40:33Z

Use the UMASH hashes that have a guaranteed lower bound on collisions as the hash table keys.

Up to 70% improvement in tsbench: https://grafana.ops.savannah-dev.timescale.com/d/fasYic_4z/compare-akuzm?orgId=1&var-branch=All&var-run1=4078&var-run2=4080&var-threshold=0.02&var-use_historical_thresholds=true&var-threshold_expression=2%20%2A%20percentile_cont%280.90%29&var-exact_suite_version=false&from=now-2d&to=now

Use the UMASH hashes that have a guaranteed lower bound on collisions as the hash table keys.

codecov · 2025-01-10T16:30:22Z

Codecov Report

Attention: Patch coverage is 90.00000% with 6 lines in your changes missing coverage. Please review.

Project coverage is 81.83%. Comparing base (59f50f2) to head (679c56d).
Report is 730 commits behind head on main.

Files with missing lines	Patch %	Lines
...des/vector_agg/hashing/hash_strategy_single_text.c	88.88%	1 Missing and 4 partials ⚠️
tsl/src/nodes/vector_agg/plan.c	50.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7586      +/-   ##
==========================================
+ Coverage   80.06%   81.83%   +1.76%     
==========================================
  Files         190      243      +53     
  Lines       37181    44828    +7647     
  Branches     9450    11184    +1734     
==========================================
+ Hits        29770    36686    +6916     
- Misses       2997     3732     +735     
+ Partials     4414     4410       -4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

erimatnor

LGTM. I added some suggestions for minor improvements. Also have some general questions.

erimatnor · 2025-01-31T14:35:12Z

tsl/src/nodes/vector_agg/hashing/hash_strategy_single_text.c

+typedef struct BytesView
+{
+	const uint8 *data;
+	uint32 len;
+} BytesView;


Instead of defining a new type, can we use StringInfo and initReadOnlyStringInfo? Just an idea.

Well, it's not a complex type, and StringInfo has some unrelated things, so I'd keep it as is.

erimatnor · 2025-01-31T14:57:28Z

tsl/src/nodes/vector_agg/hashing/hash_strategy_single_text.c

+	BytesView *restrict output_key = (BytesView *) output_key_ptr;
+	HASH_TABLE_KEY_TYPE *restrict hash_table_key = (HASH_TABLE_KEY_TYPE *) hash_table_key_ptr;
+
+	if (unlikely(params.single_grouping_column.decompression_type == DT_Scalar))


Question for my own understanding (not asking for any changes for this PR):

This is deep into vector aggregation, so I would expect that this function only gets passed an arrow array. But now we need to check for different non-array formats, including impossible cases (e.g, DT_Iterator). If we only passed in arrays, these checks would not be necessary.

The arrow array format already supports everything we need. Even scalar/segmentby values can be represented by arrow arrays (e.g., run-end encoded).

Now we need this extra code to check for different formats/cases everywhere we reach into the data. Some of them shouldn't even be possible here.

IMO, the API to retrieve a value should be something:

Datum d = arrow_array_get_value_at(array, rownum, &isnull, &valuelen);

This function can easily check the encoding of the array (dict, runend, etc.) to retrieve the value requested.

Yeah, I have already regretted supporting the scalar values throughout aggregation. Technically it should perform better, because it avoids creating e.g. arrow array with the same constant value for every row, and sometimes we perform the computations in a different way for scalar values. But the implementation complexity might be a little too high. Maybe I should look into removing this at least for the key column, and always materializing them into arrow arrays. I'm going to consider this after we merge the multi-column aggregation.

The external interface might still turn out to be more complex than what you suggest, and closer to the current CompressedColumnValues, because sometimes we have to statically generate the function that works e.g. with dictionary encoding specifically, and that won't be possible if we determine this inside an opaque callback. We can't call an opaque callback (i.e. a non-inlinable dynamic function) for every row because it's going to produce significantly less performant code.

erimatnor · 2025-02-04T08:56:30Z

tsl/src/nodes/vector_agg/hashing/hash_strategy_single_text.c

+	const int total_bytes = output_key.len + VARHDRSZ;
+	text *restrict stored = (text *) MemoryContextAlloc(policy->hashing.key_body_mctx, total_bytes);
+	SET_VARSIZE(stored, total_bytes);
+	memcpy(VARDATA(stored), output_key.data, output_key.len);


Suggest making use of PostgreSQL builtin function:

Suggested change

const int total_bytes = output_key.len + VARHDRSZ;

text *restrict stored = (text *) MemoryContextAlloc(policy->hashing.key_body_mctx, total_bytes);

SET_VARSIZE(stored, total_bytes);

memcpy(VARDATA(stored), output_key.data, output_key.len);

MemoryConext oldmcxt = MemoryContextSwitchTo(policy->hashing.key_body_mctx);

text *stored = cstring_to_text_with_len(output_key.data, output_key.len);

MemoryContextSwitchTo(oldmcxt);

I think it's better to have something that can be inlined here, because it's a part of a hot loop that builds the hash table.

tsl/src/nodes/vector_agg/hashing/hash_strategy_single_text.c

erimatnor · 2025-02-04T09:03:26Z

.github/workflows/windows-build-and-test.yaml

@@ -59,7 +59,7 @@ jobs:
        build_type: ${{ fromJson(needs.config.outputs.build_type) }}
        ignores: ["chunk_adaptive metadata telemetry"]
        tsl_ignores: ["compression_algos"]
-        tsl_skips: ["bgw_db_scheduler bgw_db_scheduler_fixed"]
+        tsl_skips: ["vector_agg_text vector_agg_groupagg bgw_db_scheduler bgw_db_scheduler_fixed"]


Why do we need to skip these tests on windows? Is it also because of UMASH?

Right, I didn't get it to compile there, so decided to disable for now.

Co-authored-by: Erik Nordström <[email protected]> Signed-off-by: Alexander Kuzmenkov <[email protected]>

Vectorized hash grouping by a single text column

9390673

Use the UMASH hashes that have a guaranteed lower bound on collisions as the hash table keys.

github-actions bot assigned akuzm Jan 10, 2025

akuzm added 3 commits January 10, 2025 14:41

forgotten files

a68572f

cleanup

4aeecac

cleanup

979e59b

akuzm mentioned this pull request Jan 10, 2025

Vectorized hash grouping #7316

Draft

10 tasks

akuzm added 2 commits January 10, 2025 15:01

bytes view

3311861

fix the test

ef106b7

akuzm added 18 commits January 10, 2025 17:50

disable clang-tidy for imported code

89fd4ad

try to detect i386 differently

192ec2c

fix cmake

37b4cca

clang-tidy

16d4908

fix?

e4b79ef

cmake

5b08c99

changelog

c4fc939

license cleanup

8cf8c69

disable on windows

8d45d71

Merge branch 'main' into hash-text

2bac2c4

fix

dc188e6

fix

b63c057

fixes

17d4124

more fixes

cfbf9d4

Merge remote-tracking branch 'origin/main' into HEAD

7cd0424

Merge remote-tracking branch 'origin/main' into HEAD

dddd0ce

add a test for groupagg mode with nulls

1919e38

format

1529ed9

akuzm marked this pull request as ready for review January 29, 2025 12:01

akuzm added 2 commits January 29, 2025 13:54

ignore the test

c7efeb0

yaml

d0e2431

akuzm added 3 commits February 3, 2025 15:18

comment

621500d

Merge remote-tracking branch 'origin/main' into HEAD

04e2776

fix after merge

18f51f5

erimatnor approved these changes Feb 4, 2025

View reviewed changes

akuzm and others added 2 commits February 4, 2025 11:29

Update tsl/src/nodes/vector_agg/hashing/hash_strategy_single_text.c

09d1036

Co-authored-by: Erik Nordström <[email protected]> Signed-off-by: Alexander Kuzmenkov <[email protected]>

Merge remote-tracking branch 'origin/main' into HEAD

679c56d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorized hash grouping by a single text column #7586

Vectorized hash grouping by a single text column #7586

akuzm commented Jan 10, 2025 •

edited

Loading

codecov bot commented Jan 10, 2025 •

edited

Loading

erimatnor left a comment

erimatnor Jan 31, 2025

akuzm Feb 4, 2025

erimatnor Jan 31, 2025

akuzm Feb 4, 2025

erimatnor Feb 4, 2025

akuzm Feb 4, 2025

erimatnor Feb 4, 2025

akuzm Feb 4, 2025

Vectorized hash grouping by a single text column #7586

Are you sure you want to change the base?

Vectorized hash grouping by a single text column #7586

Conversation

akuzm commented Jan 10, 2025 • edited Loading

codecov bot commented Jan 10, 2025 • edited Loading

Codecov Report

erimatnor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akuzm commented Jan 10, 2025 •

edited

Loading

codecov bot commented Jan 10, 2025 •

edited

Loading